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Abstract 

Background: Next-generation sequencing (NGS) is now a commonplace tool for molecular characterisation of 
virtually any species of interest. Despite the ever-increasing use of NGS in laboratories worldwide, analysis of whole 
genome re-sequencing (WGS) datasets from start to finish remains nontrivial due to the fragmented nature of NGS 
software and the lack of experienced bioinformaticists in many research teams. 

Findings: We describe SPANDx (5ynergised Pipeline for /\nalysis of A/GS Data in Linux), a new tool for high-throughput 
comparative analysis of haploid WGS datasets comprising one through thousands of genomes. SPANDx consolidates 
several well-validated, open-source packages into a single tool, mitigating the need to learn and manipulate individual 
NGS programs. SPANDx incorporates BWA for alignment of raw NGS reads against a reference genome or 
pan-genome, followed by data filtering, variant calling and annotation using Picard, GATK, SAMtools and SnpEff 
BEDTools has also been included for genetic locus presence/absence (P/A) determination to easily visualise the core 
and accessory genomes. Additional SPANDx features include construction of error-corrected single-nucleotide 
polymorphism (SNP) and insertion-deletion matrices, and P/A matrices, to enable user-friendly visualisation of genetic 
variants. The SNP matrices generated using VCFtools and GATK are directly importable into PAUP^ PHYLIP or RAxML 
for downstream phylogenetic analysis. SPANDx has been developed to handle NGS data from lllumina. Ion Personal 
Genome Machine (PGM) and 454 platforms, and we demonstrate that it has comparable performance across lllumina 
MiSeq/HiSeq2000 and Ion PGM data. 

Conclusion: SPANDx is an all-in-one tool for comprehensive haploid WGS analysis. SPANDx is open source and is freely 
available at: http://sourceforge.net/projects/spandx/. 
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Background 

The development of the first massively parallel next- 
generation sequencing (NGS) platform in 2005 [1] 
forever changed the medical and biological research 
landscape. A decade on, NGS technologies are now 
being routinely used for myriad purposes including 
whole-genome re-sequencing (WGS), genome-wide as- 
sociation studies, de novo- and re-assemblies, amplicon 
re-sequencing, polymorphism discovery, non-coding 
and coding RNA characterisation (RNA-seq), methyla- 
tion studies (Methyl-seq) and protein-DNA interactions 
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(ChlP-seq). The popularity of NGS has led to a rapid de- 
crease in operating and reagent costs that have out- 
stripped the "Moore s law" paradigm, a common yardstick 
for measuring technological success based on computa- 
tional hardware speed (http://www.genome.gov/sequen- 
cingcosts/). This plummeting cost has been brought about 
by major technological improvements and increased com- 
petition in the NGS platform market. Given continuing 
improvements in cost-effectiveness and versatility of NGS 
in molecular biology research, it is not surprising that 
NGS has become a mainstay in both small and large re- 
search laboratories across the globe. 

The desire to answer important medical or biological 
questions using NGS, and in particular WGS, has con- 
currently driven the development of analysis tools 
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designed to efficiently and accurately decode these vast 
volumes of nucleic acid data. However, analysis has been 
unable to keep pace with the volume of data being gen- 
erated. Challenges to NGS data management and ana- 
lysis include computation and storage availability and 
scalability, data sharing and privacy issues, NGS software 
costs and the requirement for bioinformaticists skilled in 
designing, programming and running complex analysis 
pipelines [2]. The technical difficulty and fragmented na- 
ture of NGS software, particularly for large-scale WGS 
analyses involving more than a handful of genomes, 
mean that comprehensive analyses remain out of reach 
for many researchers. In addition, the lack of transpar- 
ent, publicly available and standardised NGS pipelines 
has potentially led to non-validated variant outputs be- 
ing reported and perpetuated in the literature. 

To address these issues, we have developed SPANDx 
(5'ynergised Pipeline for Analysis of NGS Data in Linujv). 
SPANDx is an open-source, high-throughput, compara- 
tive genomic analysis tool for haploid organisms that in- 
tegrates well-validated, open-source programs into a 
single program, thereby simplifying and standardising te- 
dious WGS analysis workflows. SPANDx incorporates 
Burrows-Wheeler Aligner (BWA) [3,4] for read mapping 
alignment, SAMtools [5] for read filtering and parsing, 
BEDTools [6] for genetic locus presence/absence (P/A) 
determination, Picard (http://picard.sourceforge.net) for 
data filtering, the Genome Analysis Tool Kit (GATK) 
[7,8] for base quality score recalibration, variant deter- 
mination, data filtering and improved insertion-deletion 
(indel) calling, VCFtools [9] for single-nucleotide poly- 
morphism (SNP) and indel matrix construction, and 
SnpEff [10] for variant annotation. SPANDx has been 
written to analyse data generated from paired- and 
single-end Illumina (both pre- and post-vl.8 quality en- 
coding) platforms, as well as Ion PGM and 454 single- 
end data. 

SPANDx also incorporates several additional features 
aimed at minimising researcher hands-on time whilst en- 
abling customisability. Most notably, SPANDx automatic- 
ally generates a human-readable P/A matrix from 
individual BEDTools outputs, and can also construct 
error-corrected SNP and indel matrices when specified. 
These outputs enable quick and facile visualisation of gen- 
etic variants across a large number of genomes. SNP 
matrices generated by SPANDx are provided in .nex for- 
mat and are directly importable into PAUP*, PHYLIP or 
RAxML for downstream phylogenetic analysis. Inbuilt, 
pre-optimised and customisable variant calling parameters 
for Illumina and Ion PGM data obviate the need for time- 
consuming optimisation of these settings, a requirement 
of other programs (e.g. Galaxy [11]). Unlike many WGS 
tools, SPANDx does not require the user to provide as- 
sembled genomes for every strain. SPANDx is run with a 



single command and parallelises many tasks by taking ad- 
vantage of Portable Batch System (PBS) job scheduling, 
thereby reducing processing times for large datasets com- 
prising tens through to thousands of genomes. Finally, 
SPANDx has been written in relatively simple, non- 
compiled, open-source code that enables users to custom- 
ise the program by incorporating their preferred NGS 
tools (e.g. Bowtie [12] instead of BWA for read align- 
ment), or by adding new features to its workflow. 

Findings 

SPANDx description 

The SPANDx workflow is shown in Figure 1. SPANDx is 
a shell package written for implementation in a Linux en- 
vironment using Bash. SPANDx integrates multiple freely 
available Linux-based programs (BWA [3,4], SAMTools 
[5], Picard, GATK [7,8], VCFtools [9], BEDTools [6] and 
SnpEff [10]) into a single pipeline for alignment, variant 
identification, analysis and annotation from raw NGS data 
derived from haploid organisms. Using data generated 
from our prior WGS studies [13-16], we have tested the 
performance of SPANDx using paired-end Illumina 
(GA//^, MiSeq and HiSeq2000) data, and single-end Ion 
PGM, Illumina, and 454 GS-FLX/FLX+ data. SPANDx is 
designed to run in a cluster environment and utilises par- 
allel processing for the majority of the analysis pipeline. 
To facilitate parallelisation and appropriate resource allo- 
cation, SPANDx requires a Linux/UNIX system with PBS. 
The SPANDx user manual (available at: http://source- 
forge.net/projects/spandx/) provides detailed information 
on installing, operating and where desired, customising 
this program. 

Variant identification and phylogenetic analysis 

Variant (i.e. SNP and indel) identification is a fundamental 
component of any haploid WGS analysis. For this study, 
default settings for SPANDx (as detailed in the user man- 
ual) were used to identify variants; optional settings were 
included as follows. The -m flag was used to construct core 
genome SNP matrices from the individual Escherichia coli 
or Haemophilus influenzae SNP .vcf files for phylogenetic 
reconstruction. The SPANDx-generated Ortho_SNP_ma- 
trix.nex file was directly imported into PAUP* 4.0bl0 [17] 
and used to construct maximum parsimony phylogenetic 
trees (Figures 2, 3 and 4). For the seven REL £. coli ge- 
nomes, SnpEff was implemented (using the -a and -v flags) 
to annotate SNPs. 

Presence/absence (P/A) analysis of £ coli and H. 
influenzae genomes 

Defining the core (i.e. loci present in all taxa) vs. 
accessory (i.e. loci present in at least one taxon) genome 
is another fundamental application of haploid WGS. 
This information can be used for many purposes 
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Figure 1 SPANDx workflow for analysis of haploid next-generation re-sequencing data. 



including pan-genome construction, strain-, species- or 
genus-level signature identification, or for observing 
patterns of genome reduction. The coverageBED 
module of BEDTools has been incorporated into the 
SPANDx pipeline for this purpose. BEDTools deter- 
mines NGS read coverage depth and breadth across seg- 
ments or 'bins' relative to the reference genome [6], 
thereby providing an efficient way of using raw NGS 
reads to identify both core and accessory genomic loci 
within a dataset compared with a reference genome. For 
the current study, a default 1 kb window size was used 
for P/A analysis. SPANDx automatically generates 



coverageBED genetic locus P/A outputs from all input- 
ted genomes against the reference genome and com- 
bines individual outputs into a single human-readable 
matrix file (Bedcov_merge.txt). Additional file manipu- 
lation of P/A matrices was performed using basic fea- 
tures in MS Excel 2010 to create heat maps. 

Example P/A matrices generated by SPANDx for the 
E. coli and H, influenzae datasets, which highlight the 
core and accessory genomes of these species compared 
with the reference genome, are respectively shown in 
Figures 3 and 4. We have previously used these outputs 
to develop a novel speciation target for Burkholderia 
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Figure 2 Single-nucleotide polymorphism (SNP) variants identified by SPANDx across the genomes of seven clonal long-term E coli 
in vitro passaged cultures. The number of generations is indicated in parentlieses. REL606, tine ancestor for tliese passaged cultures, was used 
for reference genome comparison [Gen Bank:NC_0 12967]. As confirmed by SPANDx, REL607 is known to differ from REL606 by two SNPs [18], as 
denoted by the red vertical lines. In contrast, the -40 K strain REL10938 is a hypermutable strain [19] and SPANDx identified 607 SNPs separating 
REL10938 from REL606. Phylogenetic analysis was performed using the Ortho_SNP_matrix.nex file, an output from SPANDx that can be directly 
imported into PAUP* 4.0 [17]. Using maximum parsimony, a highly accurate tree (consistency index = 1.0) was generated in PAUP* SNPs were 
visualised with Integrative Genomics Viewer v2.3.25 [20]. 



ubonensis [13] and to characterise genome reduction in 
Burkholderia pseudomallei [14]. 

Optimised variant calling for lllumina and Ion PGM data 

Other NGS-based genomics tools such as Galaxy [11] re- 
quire users to specify variant calling settings, which can be 
a subjective and time-consuming task, particularly for users 



unfamiliar with NGS data. To combat this issue, SPANDx 
includes pre-optimised variant calling for both single- and 
paired-end haploid NGS data across the lllumina and Ion 
PGM platforms. Although these settings have been opti- 
mised using our test datasets, they can be customised if de- 
sired by altering the filtering parameters in GATK.config, a 
file that comes with the SPANDx distribution. 
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Figure 3 Core single-nucleotide polymorphism (SNP) phylogenetic analysis across 16 E coii genomes (left), and comparison with the 
accessory genome (right). The Ortho_SNP_matrix.nex file created by SPANDx was directly imported into PAUP'' 4.0 and used for phylogenetic 
construction based on 106,557 core SNPs. Using maximum parsimony, a tree with a consistency index of 0.78 was generated. The Bedcov_merge. 
txt file for presence/absence analysis of loci was automatically generated by SPANDx using the coverageBED module of BEDTools [6], based on 
the default 1 kb window size. Regions with <95% coverage across one or more genomes are displayed, representing ~1.6Mbp of the E. coli 
genome (x-axis). Coverage is shown as a heat map, with red lines equating to low or no coverage through to green lines, which represent 
uniform coverage at each 1 kb window. In combination, these tools enable facile visualisation of the core and accessory haploid genomes. 
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Figure 4 Comparison of lllumina and Ion PGM platforms using SPANDx. SPANDx was tested on 19 Australian Haemophilus influenzae strains [16] 
witli botli single-end Ion PGM and paired-end lllumina data. Strain 86-028NP [21] was used for reference alignment. From the lllumina data (top left), 
-161,000 identified SNPs were used to construct a core genome SNP cladogram (CI = 0.47). From the Ion PGM data (bottom left), -129,000 identified 
SNPs were used to construct a core genome SNP cladogram (CI = 0.48). The right-hand side panels show corresponding presence/absence data for 
each strain as described in Figure 3. For lllumina, 621 kb was found to be variable, compared with 624 kb with the Ion PGM data. Collectively, this 
comparison shows that SPANDx provides highly consistent haploid comparative genomic outputs across multiple NGS platforms. 



Phylogenetic analysis of SPANDx SNP outputs 

Using a combination of VCFtools, GATK and several 
quality control and filtering steps, SPANDx automatic- 
ally generates error-corrected core genome SNP matri- 
ces for phylogenetic analysis that can be directly 
imported into the phylogenetic programs PAUP*, PHY- 
LIP and RAxML, the latter two of which are open- 



source software. The extensive error checking, filtering 
and variant identification steps undertaken in SPANDx 
using GATK ensure that the identified SNPs are as ac- 
curate as possible using NGS data. Example maximum 
parsimony analyses of SPANDx-generated data for the 
E. coli and H, influenzae datasets are shown in Figures 2, 
3 and 4. 
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Program comparison: SPANDx vs. BRESEQ 

We performed an in-depth comparison of SPANDx with 
BRESEQ, a comparative genomics tool specifically designed 
for identifying SNPs, indels and large deletions in closely re- 
lated microbial-sized genomes (http://barricklab.org/twiki/ 
bin/view/Lab/ToolsBacterialGenomeResequencing). Due to 
limitations on BRESEQ data inputs, we only compared the 
six closely related E. coli REL genomes spanning 2 K to 
40 K generations; REL607 was not included in the BRESEQ 
study [19] and was therefore excluded in this comparison. 
Default settings for SPANDx (as detailed in the SPANDx 
user manual) were used to identify variants. SnpEff was im- 
plemented using the -a and -v flags in SPANDx to annotate 
SNPs. 

SNPs 

SPANDx and BRESEQ identified identical SNPs for the 
2 K, 5 K, 10 K, 15 K and 20 K mutants (3, 9, 16, 22 and 
28 SNPs, respectively) [19]. One additional SNP in the 
20 K strain, located at position 2129116 of insB-lS in 
REL606, was not identified by either SPANDx or BRE- 
SEQ and was only discovered by Sanger sequencing [19]. 
This SNP was not able to be identified from NGS read 
data due to the paralogous nature of the ISl insertion 
sequence element in this genome. BLAST analysis of ISl 
in REL606 identified 27 highly related copies (>99% 
match across 100% of bases), with up to three SNPs 
present among the paralogues. Using NGS data, espe- 
cially data harbouring relatively small insert sizes (-80- 
170 bp with this dataset), such loci cannot be accurately 
mapped. Therefore, the exclusion of this SNP from both 
the SPANDx and BRESEQ pipelines demonstrates the 
inherent limitations of using short-read NGS data for 
variant calling in large paralogous loci. 

SPANDx analysis of the 40 K strain identified only 608 
SNPs separating the hypermutable strain REL 10938 from 
its REL606 ancestor, compared with the 626 SNPs found 
using BRESEQ. Closer examination found that these 18 
SNPs were either not identified by SPANDx or were ex- 
cluded using the default filtering parameters due to non- 
polymorphic {n = 1) or ambiguous {n^ll) genotypes, or 
poor mapping quality and/or insufficient (<0.5x of aver- 
age) coverage {n = 6). The default parameters for SNP call- 
ing in SPANDx have been optimised such that the ability 
to identify only real' variants is maximised; false-positives 
are not tolerated with these settings, in line with GATK 
recommendations. Loosening of these parameters results 
in additional SNPs being identified, some of which may 
turn out to be real' upon confirmation with e.g. Sanger 
sequencing; however, the trade-off is that false-positives 
begin plaguing the dataset (results not shown). Given 
the nature of NGS data and the behaviour of NGS align- 
ment programs, neither variant calling method is incor- 
rect per se, but these minor differences between programs 



highlight the need to verif)^ questionable SNPs from NGS 
data using secondary methods including manual inspec- 
tion of NGS read alignments in e.g. Tablet [22], or wet 
laboratory-based analyses such as Sanger sequencing or 
allele-specific PGR. 

Indels and chromosomal rearrangements 

Comparison of SPANDx and BRESEQ for identifying 
small (<20 bp) indels in the REL strain cohort demon- 
strated that both methods were identical (variants are de- 
tailed in Supplementary Table two from [19]). Neither 
method identified a known 1.49Mbp inversion [23]. 

Large deletions and insertions 

Large insertions are not currently able to be detected 
using SPANDx. However, for highly related strains these 
signatures can be detected with BRESEQ, as exemplified 
by the identification of ten IS element insertions with 
BRESEQ that were not found by SPANDx. Identification 
of large deletions (>20 bp) showed that, on a gross level, 
there was good consistency between the programs. How- 
ever, the size of the deletions varied between SPANDx 
and BRESEQ, with SPANDx overestimating deletion size 
for three of the five identified deletions by -0.7 to 
1.4 kb. BLAST analysis of these regions showed that the 
additional sequence called as 'deleted' by SPANDx corre- 
sponded with paralogous IS element loci (results not 
shown). This finding was expected, being consistent with 
inherent read mapping difficulties across paralogous loci 
using short-read NGS data. 

Program comparison: SPANDx vs. Galaxy 

Although we did not directly test Galaxy in this study, a 
previous study has used this program to compare E. coli 
strain REL607 with REL606 [18]. SPANDx identified that 
REL607 is a dual- nucleotide variant of REL606 at the 
araA and recD loci (Figure 2); no indels were found by 
either program. Thus, SPANDx confirmed previous vari- 
ant findings identified using Galaxy [18]. 

Cross-platform reproducibility of SPANDx 

The performance of bioinformatics tools across multiple 
NGS platforms is an important consideration for analysis 
reproducibility and program utility. To address this ques- 
tion, we tested the performance of SPANDx using 19 H, 
influenzae strains subjected to two different NGS plat- 
forms: single-end Ion PGM and paired-end lUumina 
(MiSeq and HiSeq2000). SPANDx constructed almost 
identical core genome SNP phylogenies with these two 
datasets (Figure 4) despite being generated from platforms 
with inherently different error profiles and chemistries. In 
addition, P/A determination across these 19 genomes was 
essentially identical with these two platforms (Figure 4). 
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These data demonstrate the robustness and accuracy of 
SPANDx across multiple NGS platforms. 

Discussion 

SPANDx is a simple-to-use, high-throughput, and open 
source comparative genomics tool that has been devel- 
oped for the integrated analysis of haploid WGS data from 
start to finish with minimal hands-on time. SPANDx has 
been written to handle multiple NGS platforms and cur- 
rently can analyse single- and paired-end read data from 
the Illumina MiSeq/HiSeq/GA//^ platforms, and single- 
end data from the Ion PGM and 454 GS FLX/FLX+ plat- 
forms. Because SPANDx uses PBS resource management, 
it has the capability of performing both single-core and 
parallel task processing, resulting in rapid turn-around- 
time, especially for medium- to large-scale WGS datasets 
comprising one, ten or even thousands of genomes. 

SPANDx integrates existing, freely available comparative 
WGS analysis tools (BWA, Picard, the GATK, SAMTools, 
SnpEff, BED Tools and VCFtools) into a single pipeline. 
Importantly, SPANDx incorporates novel features for 
comprehensive analysis of raw haploid WGS data, and is 
aimed at simplifying downstream analysis (Figure 1) and 
increasing the user friendliness of data outputs. First, 
SPANDx automatically constructs P/A matrices of genetic 
loci using raw outputs generated by the coverageBED 
module of BEDTools. This feature enables identification 
of the core genome, a common goal of comparative hap- 
loid genome analyses. We have used this tool to design 
highly accurate species-specific assays for B. ubonensis 
[13], H. influenzae and Haemophilus haemolyticus (Price 
et al, manuscript in prep.), based on the identification of 
highly conserved loci that are absent in other species. Sec- 
ond, SPANDx can construct annotated, merged SNP and 
indel matrices from .vcf outputs. When a SNP matrix is 
generated, SPANDx will generate PAUP*, PHYLIP or 
RAxML-compatible outputs for downstream phylogenetic 
analysis (e.g. Figures 2, 3 and 4). Third, SPANDx contains 
pre-optimised yet customisable variant calling parameters 
for Illumina and Ion PGM data by default, allowing users 
to run analyses without spending a large amount of time 
optimising these parameters. These novel features of 
SPANDx enable users to quickly compare genomic data 
outputs without cumbersome and time-consuming ma- 
nipulation of variant outputs. 

Existing open-source comparative genomic tools for 
haploid NGS data analysis include Galaxy and BRESEQ. 
Galaxy (http://galaxyproject.org/) is a popular NGS tool 
that does not require any knowledge of Linux. The web- 
based version of Galaxy is particularly useful for small- 
scale analyses. Other advantages of Galaxy include its 
standardised outputs, frequent developer updates, cloud- 
based computer resource availability, and the ability to 
install the program locally where data privacy is of 



concern. The main limitation of Galaxy is the hands-on 
time required to construct an analysis pipeline, especially 
the need to manually optimise the filtering and data pro- 
cessing steps. 

BRESEQ [19] is a command line tool implemented in 
C++ and R that is useful for finding variants (SNPs, 
indels, large deletions and new junctions supported by 
mosaic reads) relative to a closely related reference gen- 
ome. Comparison of BRESEQ and SPANDx outputs in 
the current study demonstrated that both programs gave 
almost identical SNP and indel outputs, suggesting that 
both tools excel for this purpose. However, less consen- 
sus was found when identifying large deletion boundar- 
ies, with SPANDx overestimating deleted regions in 3/5 
cases due to paralogous IS element loci flanking these 
regions, which cannot be mapped with short-read NGS 
data. BRESEQ has an additional advantage over SPANDx 
in its ability to identify larger (> -20 bp) insertions, as 
SPANDx is not currently configured for this purpose. 
However, unlike SPANDx, BRESEQ is not appropriate 
for WGS analysis of more distantly related genomes or 
for medium- to large-scale datasets. Due to its lack of 
parallel processing, users of BRESEQ are limited to a ref- 
erence genome of <20 Mb, an average genome coverage 
of <20x, and < 1,000 expected mutations, and many 
comparative genomic functions are yet to be incorpo- 
rated into its pipeline. BRESEQ also requires consider- 
ably more hands-on time to merge variant files than 
SPANDx and is thus not practical to use for more than a 
handful of genomes. 

SPANDx has other advantages over existing tools and 
pipelines, including error-corrected SNP and indel matri- 
ces. To minimise effort and to standardise outputs across 
studies, SPANDx variant calling parameters have been 
optimised on our bacterial NGS datasets but can be custo- 
mised to the users preference. Using default settings, we 
have demonstrated that SPANDx performs comparably 
for SNP calling across Illumina MiSeq/HiSeq2000- and 
Ion PGM-generated data. To the best of our knowledge, 
other pipelines have not been tested and validated across 
multiple NGS platforms. 

Recognised shortcomings of SPANDx include the inabil- 
ity to identify SNP variation in paralogous regions, or in- 
versions, although these issues were also identified in 
BRESEQ and are the result of NGS data and not an inher- 
ent shortcoming of these programs. Currently, SPANDx 
requires PBS to perform parallel processing and cannot be 
run on systems that do not possess this software. To in- 
crease the utility of SPANDx future versions will include 
the ability to run this pipeline with multiple resource han- 
dlers. Although SPANDx uses BEDTools for identifying 
large deletions, this program does not accurately pinpoint 
the exact positions of large deletions and further analysis 
is needed. SPANDx currently does not contain tools for 
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identifying large insertions. For those wishing to identify 
chromosomal rearrangements or large (>20 bp) insertions, 
or to accurately characterise large deletions, it is recom- 
mended that genome assemblies are used instead of 
SPANDx (or similar programs). 

Conclusion 

The NGS era has enabled researchers to generate unpre- 
cedented amounts of genomic data, but there remains a 
bottleneck in analysis. Genomic analysis pipelines such 
as SPANDx provide a streamlined way of decoding these 
data without the requirement for researchers to "re- 
invent the wheel" or learn multiple NGS programs. 
SPANDx is currently written to handle only haploid re- 
sequencing datasets. However, future development of 
SPANDx will include the ability to use other resource 
handlers (e.g. SGE), de novo assembly of accessory gen- 
ome components from unaligned reads, de novo and 
reference-assisted genome assemblies, tools for insertion 
and chromosomal rearrangement detection and the abil- 
ity to analyse diploid NGS data e.g. the human genome. 

Availability and requirements 

Project name: SPANDx 

Project homepage: https://sourceforge.net/projects/ 
spandx/ 

Operating system: Linux 
Programming language: Bash 

Other requirements: Portable Batch System (TORQUE 

2.5.13), Java 1.7.0_55, Burrows-Wheeler Aligner (BWA) 

0.6.2, SAMtools 0.1.19, BEDTools 2.18.2, Picard 1.105, 

the Genome Analysis Tool Kit (GATK) 3.0 or higher, 

VCFtools 0.1.11, tabix 0.2.6 and SnpEff 3.6. 

License: GNU General Public License version 3.0 

(GPLv3) 

Any restrictions to use by non-academics: Yes. Com- 
mercial users of GATK are required to obtain a licence 
for use. For further information, see www.appistry.com/ 
gatk. As of version 3.1, GATK is open source to not-for- 
profit institutions only. SPANDx and all other software 
used by SPANDx are open source. 

Availability of supporting data 

Two NGS datasets were used in this study. The first dataset 
comprised 16 publicly available E, coli lUumina HiSeq2000- 
generated genomes (Sequence Read Archive [SRA] acces- 
sions ERX287459, ERX287470, ERX287479, ERX287533, 
ERX287535 through ERX287538; ERX287540, SRX012986, 
and SRX012988 through SRX012993). Seven are isogenic 
'REL isolates from long-term evolution experiments 
(http://myxo.css.msu.edu/ecoli/) that span -40,000 in vitro 
generations [19] and the additional nine are other more dis- 
tantly related £. coli genomes from the SRA database. The 
FASTA file for the closed £. coli genome REL606 [19] was 



used as the reference for variant calling and annotation. 
The second dataset comprised 20 Australian H, influenzae 
strains sequenced using both the Ion PGM [16] and Illu- 
mina MiSeq [16] or HiSeq2000 platforms. The FASTA file 
for the closed H, influenzae 86-028NP genome [21] was 
used as the reference for variant calling. 
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